This tutorial will cover the basic steps in analysing scRNA-seq data. The dataset comes from this paper which described the immune microenvironment of breast cancer tumours by performing droplet-based scRNA-seq of CD45+ cells from breast cancer patients. For the purpose of this tutorial, we’ll just be working on a subset of the full dataset to ensure there is enough time to run everything. We’ll be using the Seurat R package to run most of the analysis, which is the most commonly used tool for analysing scRNA-seq data. Run the code chunks below to read in the data and get started. This tutorial is adapted from materials from Dr. Sarah Ennis, previously PhD candidate at University of Galway in CRT in Genomics Data Science program.
options(future.globals.maxSize = 10 * 1024^3)
# load libraries
library(devtools)
library(tidyverse) # install.packages("tidyverse")
library(Seurat) # install.packages('Seurat')
library(clustree) # install.packages("clustree")
library(ggsignif) # install.packages("ggsignif")
library(clusterProfiler) # BiocManager::install("clusterProfiler")
library(org.Hs.eg.db) # BiocManager::install("org.Hs.eg.db")
library(ggrepel) # install.packages("ggrepel")
library(patchwork) # devtools::install_github("thomasp85/patchwork")
library(scPOP) # devtools::install_github('vinay-swamy/scPOP')
library(SingleCellExperiment) # BiocManager::install("SingleCellExperiment")The data is stored as a Seurat object, with 14,875 genes and 9,671
cells. We can find more information about each of the cells in the
meta.data attribute of the Seurat object.
## nCount_RNA nFeature_RNA sample patient tissue replicate
## 7169-BC2-NORMAL 2044 540 BC2-NORMAL-1 BC2 NORMAL 1
## 7170-BC2-NORMAL 3756 872 BC2-NORMAL-1 BC2 NORMAL 1
## 7171-BC2-NORMAL 1555 424 BC2-NORMAL-1 BC2 NORMAL 1
## 7172-BC2-NORMAL 2886 609 BC2-NORMAL-1 BC2 NORMAL 1
## 7173-BC2-NORMAL 2054 699 BC2-NORMAL-1 BC2 NORMAL 1
## 7174-BC2-NORMAL 1105 396 BC2-NORMAL-1 BC2 NORMAL 1
## cluster cell_type_major cell_type_minor
## 7169-BC2-NORMAL 1 T T:CD4+NAIVE
## 7170-BC2-NORMAL 1 T T:CD4+NAIVE
## 7171-BC2-NORMAL 1 T T:CD4+NAIVE
## 7172-BC2-NORMAL 1 T T:CD4+NAIVE
## 7173-BC2-NORMAL 1 T T:CD4+NAIVE
## 7174-BC2-NORMAL 1 T T:CD4+NAIVE
We can also access metadata columns using the $
operator.
##
## BC1-NORMAL-1 BC1-NORMAL-2 BC1-NORMAL-3 BC1-NORMAL-4 BC1-TUMOR-1 BC1-TUMOR-3
## 639 582 952 736 1117 253
## BC1-TUMOR-4 BC2-NORMAL-1 BC2-NORMAL-2 BC2-NORMAL-3 BC2-TUMOR-1 BC2-TUMOR-2
## 1191 379 376 428 567 505
## BC2-TUMOR-3 BC2-TUMOR-4 BC3-NORMAL-2 BC3-NORMAL-4 BC3-TUMOR-1 BC3-TUMOR-3
## 684 223 201 49 68 618
## BC3-TUMOR-5
## 103
##
## NORMAL TUMOR
## BC1 2909 2561
## BC2 1183 1979
## BC3 250 789
The cells come from 19 different samples from 3 different patients. For each patient, there are at least 2 replicates taken from both tumour and normal tissue.
The actual gene expression values are stored in the
assays attribute of the Seurat object.
## 11 x 15 sparse Matrix of class "dgCMatrix"
## [[ suppressing 15 column names '7169-BC2-NORMAL', '7170-BC2-NORMAL', '7171-BC2-NORMAL' ... ]]
##
## ATG2A . . . . . . . . . 3 . . . . .
## ATG2B . . . . . . . . . . . . . . .
## ATG3 . . . . . . . . . . 4 . . . .
## ATG4A . . . . . . 1 . . . . . . . .
## ATG4B . 4 . . . . . . . . . . . . .
## ATG4C . . . . . . . . . . . . . . .
## ATG4D . 4 . . . . . . . . . . . . .
## ATG5 . . . . . . . . 2 . . 4 . . .
## ATG7 . . . . . . . . . . . . . . .
## ATG9A . . . . . . . . . . . . . . .
## ATG9B . . . . . . . . . . . . . . .
The data is currently in the form of counts i.e. number of transcripts mapping to each gene that were detected in each cell.
To ensure poor quality cells are excluded from analysis, some filtering steps are commmonly performed. Cells are usually filtered on the basis of 3 quality control metrics:
nCount_RNA - This is the total number of UMIs which
were detected in a cell. Cells with unusually high counts could
represent doublets (where more than one cell gets caught in a droplet
and tagged with the same barcode) and cells with unusually low counts
could represent empty droplets (where ambient RNA from cells that have
lysed in the cell suspension gets caught in a droplet and tagged with a
barcode).
nFeature_RNA - This is the total number of genes for
which transcripts were detected in each cell. Similar to counts - cells
with high genes could be doublets and cells with low genes could be
empty droplets. Usually, nCount_RNA and
nFeature_RNA are combined into one filter e.g. remove cells
with counts <= 500 & genes <= 200.
percent_mt - This value represents the percentage of
counts in a cell that map to the mitochondrial genome. Cells with high
mitochondrial gene percentage can represent stressed or dying cells,
where the cell has lysed prior to droplet generation and most of the
nuclear mRNA has leaked out but the mitochondrial mRNA will still be
present inside the mitochondria.
Samples are typically QC-ed one by one and filters are decided for each sample by examining the distribution of the QC metrics above to decide on reasonable thresholds.
This dataset has already been QC-ed so no cells should need to be removed but we can still look at the distribution of the QC metrics.
# first calculate the mitchondrial percentage for each cell. MT genes always starte
dat$percent_mt <- PercentageFeatureSet(dat, pattern="^MT.")
# make QC plots
VlnPlot(dat, features = c("nCount_RNA", "nFeature_RNA", "percent_mt")) From these violin plots we can see that there doesn’t appear to be
much outliers for each of the metrics and we can see that cells have
clearly already been filtered on the basis of nCount_RNA
and nFeature_RNA - there are no cells with less than 100
counts and no cells with less than 75 genes. These filters become more
obvious if you view the distribution of metrics in each sample
individually.
All these samples appear to contain good quality cells and no further filtering is necessary. If you were to perform QC steps based on nFeature, nCount, and mithocondrial percentage, below is the code:
dat <- subset(dat, subset = nFeature_RNA > 75 & nCount_RNA > 100 & percent_mt < 20)
VlnPlot(dat, features = c("nCount_RNA", "nFeature_RNA", "percent_mt")) After removing unwanted cells from the dataset, the next step is to normalise the data. By default, Seurat uses a global-scaling normalisation method “LogNormalize” that normalises the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.
The normalised data and the raw counts are both stored in Seurat object but by default, Seurat will use the normalised values for any downstream analysis.
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names '7169-BC2-NORMAL', '7170-BC2-NORMAL', '7171-BC2-NORMAL' ... ]]
##
## A1BG . . . . . . 2.662588 . . .
## A2M . . . . . . . . . .
## A4GALT . . . . . . . . . .
## AAAS . . . . . . . . . .
## AACS . . . . . . . . . .
## AADAC . . . . . . . . . .
## AADAT . . . . . . . . . .
## AAED1 . . . . . . . . . .
## AAGAB . . . . . . . . . .
## AAK1 1.773658 . 3.010257 . . 2.30755 . 2.85455 . 1.403951
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names '7169-BC2-NORMAL', '7170-BC2-NORMAL', '7171-BC2-NORMAL' ... ]]
##
## A1BG . . . . . . 3 . . .
## A2M . . . . . . . . . .
## A4GALT . . . . . . . . . .
## AAAS . . . . . . . . . .
## AACS . . . . . . . . . .
## AADAC . . . . . . . . . .
## AADAT . . . . . . . . . .
## AAED1 . . . . . . . . . .
## AAGAB . . . . . . . . . .
## AAK1 1 . 3 . . 1 . 5 . 1
In order to extract meaningful biological signals from the dataset,
Seurat aims to identify a subset of features (e.g. genes) exhibiting
high variability across cells, and therefore represent heterogeneous
features to prioritise for downstream analysis. Choosing genes solely
based on their log-normalised single-cell variance fails to account for
the mean-variance relationship that is inherent to single-cell RNA-seq.
Therefore, a variance-stabilising transformation is applied to correct
for this before calculating the mean-variance relationship, implemented
in the FindVariableFeatures() function.
# find the 5000 most variable genes
dat <- FindVariableFeatures(dat, selection.method = "vst", nfeatures = 5000)
head(VariableFeatures(dat), 20)## [1] "TPSAB1" "S100A9" "LYZ" "CPA3" "CD74" "CST3"
## [7] "S100A8" "RNASE1" "TPSB2" "HLA.DRA" "CXCL13" "CTSG"
## [13] "FCGBP" "HLA.DRB1" "FCN1" "PCNX3" "CLC" "CCL22"
## [19] "VCAN" "HLA.DPB1"
The most variable genes in the dataset are usually genes that are markers for a specific type of cell. For example, S100A8 and S100A9 (2 of the most variable genes) are markers for monocytes.
Prior to performing dimensionality reduction techniques such as PCA, the dataset is centered and scaled. What this process does is:
Shift the expression of each gene, so that the mean expression across cells is 0.
Scale the expression of each gene, so that the variance across cells is 1.
This step gives equal weight in downstream analyses, so that
highly-expressed genes do not dominate. After performing scaling, the
results are stored in dat[["RNA"]]@scale.data.
# scale all genes, not just HVGs
all.genes <- rownames(dat)
dat <- ScaleData(dat, features = all.genes)## Centering and scaling data matrix
## 7169-BC2-NORMAL 7170-BC2-NORMAL 7171-BC2-NORMAL 7172-BC2-NORMAL
## A1BG -0.18295680 -0.18295680 -0.18295680 -0.18295680
## A2M -0.20956480 -0.20956480 -0.20956480 -0.20956480
## A4GALT -0.02939308 -0.02939308 -0.02939308 -0.02939308
## AAAS -0.10087617 -0.10087617 -0.10087617 -0.10087617
The next task is to visualise the dataset. To do this we need to reduce the dimensionality of the data, as there’s no way we can visualise ~14,000 dimensions. PCA is typically used first to reduce the data to around 15 dimensions and then more complex algorithms such as tSNE or UMAP can be used to reduce to 2 dimensions and visualise the data.
PCA will be performed on the highly variable genes.
# this will take some times to run
dat <- RunPCA(dat, features = VariableFeatures(object = dat), verbose = F)We can check which genes contribute to each of the principal components.
## PC_ 1
## Positive: RPS27, RPS28, MALAT1, CCL5, TRBC2
## Negative: CST3, HLA.DRA, HLA.DRB1, HLA.DPA1, HLA.DPB1
## PC_ 2
## Positive: C1QA, C1QB, C1QC, C3, SLCO2B1
## Negative: NKG7, PFN1, RPS26, GNLY, PRF1
We can also visualise the principal components as scatter plots.
The reason PCA is performed is to compress the dataset into a robust representation of the heterogeneity present in the dataset for downstream clustering, however now we are faced with an issue: how many PCs to include for downstream analyses?
The easiest (and quickest) way to decide this is with an elbow plot - a ranking of principle components based on the percentage of variance explained by each PC.
From this plot we might conclude that taking the top 10 PCs makes the most sense as not much more variance is explained by including any PCs after 10.
Both UMAP and tSNE are forms of graph-based clustering. The first
step in this process is to construct a KNN graph based on the euclidean
distance in PCA space, and refine the edge weights between any two cells
based on the shared overlap in their local neighborhoods (Jaccard
similarity). This step is performed using the
FindNeighbors() function, and takes as input the previously
defined dimensionality of the dataset (we will include the top 10 PCs
here).
## Computing nearest neighbor graph
## Computing SNN
This graph can now be used as input for the RunUMAP()
function. The goal of this algorithm is to learn the underlying manifold
of the data in order to place similar cells together in low-dimensional
space.
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 10:52:05 UMAP embedding parameters a = 0.9922 b = 1.112
## 10:52:05 Read 9657 rows and found 10 numeric columns
## 10:52:06 Using Annoy for neighbor search, n_neighbors = 30
## 10:52:06 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 10:52:08 Writing NN index file to temp file /tmp/RtmpnYMx0U/file3703146e65d88a
## 10:52:08 Searching Annoy index using 1 thread, search_k = 3000
## 10:52:16 Annoy recall = 100%
## 10:52:19 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 10:52:25 Initializing from normalized Laplacian + noise (using RSpectra)
## 10:52:27 Commencing optimization for 500 epochs, with 397338 positive edges
## 10:52:53 Optimization finished
We can layer metadata on top of this plot to see what is driving the clustering of the cells on the graph.
DimPlot(dat, reduction = 'umap', group.by = c('sample', 'patient', 'tissue', 'cell_type_major'), ncol = 2)From the plots above we can see that cells are mostly clustering by cell type but there is still a bit of separation between cells from different patients. Batch effects like this are really common with single-cell data and plenty of tools have been developed to overcome them.
Aside from visual scanning for batche effect, we can use several metrics to measure batch effect in the data. Here, we will discuss NMI and LISI. The NMI (normalized mutual information) measures mutual information or overlap between two label of dataset. A value of 0 represents no mutual information and a value of 1 represents complete overlap. LISI quantifies how well cells from different batches are mixed (or how well cell identities are preserved) in the local neighborhood of a given cell. Values of LISI ranges from 1 to 2. The higher it is the better the batch mixing is. You want to strike a balance, high LISI score for batch while retaining low LISI score for actual biological variance (cell type, cancer type, disease condition, etc.)
dat_sc <- as.SingleCellExperiment(dat)
# Check how much mutual information is shared between batch source (patients) and cell types annotation
nmi_score_seurat <- nmi(dat_sc$patient, dat_sc$cell_type_major)
# Score LISI for batch and for cell type
lisi_score_seurat_patient <- lisi(X = reducedDim(dat_sc,'UMAP'),
meta_data = as.data.frame(colData(dat_sc)),
label_colnames = 'patient')
lisi_score_seurat_celltype <- lisi(X = reducedDim(dat_sc,'UMAP'),
meta_data = as.data.frame(colData(dat_sc)),
label_colnames = 'cell_type_major')
average_lisi_patient <- mean(lisi_score_seurat_patient$patient,
na.rm = T)
average_lisi_celltype <- mean(lisi_score_seurat_celltype$cell_type_major,
na.rm = T)
nmi_score_seurat## [1] 0.2019862
## [1] 1.317605
## [1] 1.137328
Seurat has its own method for integrating cells from different datasets/batches which uses canonical correlation analysis (CCA) to identify ‘anchors’ between the datasets, which are then used to align them.
So far we’ve been treating the dataset as a single batch. To perform batch effect correction on the dataset we need to split the Seurat object into a list of Seurat objects, one for each patient, and process them individually.
# split the dataset into a list of seurat objects (one for each patient)
dat@assays$RNA@data <- dat@assays$RNA@counts # restore counts
patient.list <- SplitObject(dat, split.by = "patient")
# normalise and identify variable features for each batch independently
patient.list <- lapply(X = patient.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 4000)
})
# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = patient.list, verbose = F)We can now use the FindIntegrationAnchors() function to
identify features that are highly correlated across batches (‘anchors’).
This step may take a few minutes to run.
anchors <- FindIntegrationAnchors(object.list = patient.list, anchor.features = features, verbose = F)The anchors are then used as input to the
IntegrateData() function to create a new Seurat object with
an ‘integrated’ assay, which contains the batch corrected values.
## Merging dataset 2 into 1
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Warning: Layer counts isn't present in the assay object; returning NULL
## Merging dataset 1 2 into 3
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Warning: Layer counts isn't present in the assay object; returning NULL
DefaultAssay(dat.combined) <- 'integrated' # set default assay to 'integrated' - the uncorrected values are still present in 'RNA'
dat.combined@assays$integrated@data[1:4, 1:4]## 4 x 4 sparse Matrix of class "dgCMatrix"
## 7169-BC2-NORMAL 7170-BC2-NORMAL 7171-BC2-NORMAL 7172-BC2-NORMAL
## CD74 -0.79940512 1.28189106 -0.59023774 -0.59838800
## HLA.DRA 0.05971967 -0.02431613 -0.04617379 -0.01571815
## TPSAB1 . 0.03063271 . .
## S100A9 . . -0.01671174 1.41857257
You’ll notice the corrected assay contains negative values. These values should not be treated as traditional gene expression values and should not be used for things like differential expression analysis but can be used for constructing an integrated graph.
We can now run the same steps as above on the integrated assay to generate visualisations where the batch effect should be less obvious.
# Run the standard workflow for visualisation and clustering
dat.combined <- ScaleData(dat.combined, verbose = FALSE)
dat.combined <- RunPCA(dat.combined, npcs = 10, verbose = FALSE)
dat.combined <- FindNeighbors(dat.combined, dims = 1:10)## Computing nearest neighbor graph
## Computing SNN
## 11:02:29 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:02:29 Read 9657 rows and found 10 numeric columns
## 11:02:29 Using Annoy for neighbor search, n_neighbors = 30
## 11:02:29 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:02:31 Writing NN index file to temp file /tmp/RtmpnYMx0U/file370314530b022b
## 11:02:31 Searching Annoy index using 1 thread, search_k = 3000
## 11:02:39 Annoy recall = 100%
## 11:02:42 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 11:02:46 Initializing from normalized Laplacian + noise (using RSpectra)
## 11:02:47 Commencing optimization for 500 epochs, with 412548 positive edges
## 11:03:09 Optimization finished
# Visualisation
DimPlot(dat.combined, reduction = 'umap', group.by = c('tissue', 'cell_type_major'), ncol = 2, label = T)dat_sc_corrected <- as.SingleCellExperiment(dat.combined)
# Score LISI for batch and for cell type
lisi_score_seurat_patient <- lisi(X = reducedDim(dat_sc_corrected,'UMAP'),
meta_data = as.data.frame(colData(dat_sc_corrected)),
label_colnames = 'patient')
lisi_score_seurat_celltype <- lisi(X = reducedDim(dat_sc_corrected,'UMAP'),
meta_data = as.data.frame(colData(dat_sc_corrected)),
label_colnames = 'cell_type_major')
average_lisi_patient_corrected <- mean(lisi_score_seurat_patient$patient,
na.rm = T)
average_lisi_celltype_corrected <- mean(lisi_score_seurat_celltype$cell_type_major,
na.rm = T)
lisi_scores <- data.frame(
Uncorrected = c(average_lisi_patient, average_lisi_celltype),
Corrected = c(average_lisi_patient_corrected, average_lisi_celltype_corrected),
row.names = c("Patient", "CellType")
)
print(lisi_scores)## Uncorrected Corrected
## Patient 1.317605 1.851048
## CellType 1.137328 1.181753
Because the batch effect was pretty minor to begin with, the plots above look fairly similar to the uncorrected ones but there is slightly more mixing between patients and there is now a separate cluster for the mast cells. Furthermore, LISI score showed considerable improvement from 1.3 to 1.8 while still retaining high clustering of cell types (1.13 -> 1.18). Aside from native CCA method in Seurat, there are many others batch correction methods including Harmony (my personal favorite), LIGER, Scanorama, and BBKNN in R and Python.
A common step in single-cell analysis is to run a community detection
algorithm (such as Louvain or Leiden) to label similar cells as being
part of the same cluster. We can do this in Seurat using the
FindClusters() function which implements the Louvain
algorithm by default. The clusters are detected from the KNN graph which
we built above using the FindNeighbors() function. The
FindClusters() function contains a resolution parameter
that sets the ‘granularity’ of the downstream clustering, with increased
values leading to a greater number of clusters. The cluster labels will
be stored in the meta.data attribute of the Seurat
object.
# identify clusters using multiple different resolution values
dat.combined <- FindClusters(dat.combined, resolution = seq(0.2, 1.2, 0.2)) ## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 9657
## Number of edges: 299195
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9027
## Number of communities: 7
## Elapsed time: 7 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 9657
## Number of edges: 299195
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8556
## Number of communities: 10
## Elapsed time: 6 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 9657
## Number of edges: 299195
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8272
## Number of communities: 11
## Elapsed time: 7 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 9657
## Number of edges: 299195
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8018
## Number of communities: 13
## Elapsed time: 7 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 9657
## Number of edges: 299195
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7827
## Number of communities: 14
## Elapsed time: 6 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 9657
## Number of edges: 299195
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7653
## Number of communities: 18
## Elapsed time: 8 seconds
# visualise
DimPlot(dat.combined, reduction = 'umap', group.by = c('integrated_snn_res.0.2', 'integrated_snn_res.0.4', 'integrated_snn_res.0.6', 'integrated_snn_res.0.8', 'integrated_snn_res.1', 'integrated_snn_res.1.2'), label = T, repel = T, ncol = 2)As the resolution parameter increases, so does the number of clusters
identified. Deciding on the optimal number of clusters to use is an
unsolved question as it’s impossible to know what the ground truth is
but plenty of tools have been developed that provide heuristics to help
make the decision. One of these tools is clustree which
works quite nicely with Seurat objects.
This clustering tree visualises the relationships between clusters at a range of resolutions by looking at how cells move as the clustering resolution is increased. Each cluster forms a node in the tree and edges are constructed by considering the cells in a cluster at a lower resolution that end up in a cluster at the next highest resolution. By connecting clusters in this way we can see how clusters are related to each other, which are clearly distinct and which are unstable.
We can see that some clusters are very distinct and do not change with the value of the resolution parameter. On the other side of the tree we see the tree becomes messier and there are nodes with multiple incoming edges as the resolution parameter is increased. This is a good indication that we have over clustered the data. The optimal resolution for clustering this dataset is probably around 0.2-0.6.
Given that we already have cell type labels for this dataset, we can also look at how the cluster labels overlap with the cell type labels to get an idea of the best value for the resolution parameter.
plot_df <- dat.combined@meta.data %>% pivot_longer(cols = starts_with('integrated_snn_res.'), names_to = 'res')
ggplot(plot_df, aes(x = value, fill = cell_type_major)) +
geom_bar(position = 'fill') +
scale_y_continuous(labels = scales::percent, expand = c(0,0)) +
facet_wrap(~res, scales = 'free') +
theme_minimal()We can see that the resolution 0.2 clusters overlap best with the cell type labels.
This dataset already contains cell type labels but this is obviously not the case if you’re conducting a single-cell analysis from scratch. There are multiple approaches out there for labelling cells but most of them fall into one of the following two categories:
Looking at expression of known marker genes in different clusters and labelling cells accordingly.
Comparing your dataset to a reference dataset from the same tissue and predicting cell type labels based on the reference.
In this case, the authors used a combination of these two approaches. They first clustered cells using an algorithm that they developed in a previous study (Phenograph/Biscuit) and then compared the expression profiles of these clusters to several previously published bulk-sequencing datasets of sorted immune populations. The highest-scoring bulk profile for each cluster was used as a label for that cluster. The plot below shows the original cluster label for each cell and the corresponding cell type annotation for each cluster.
p1 <- DimPlot(dat.combined, reduction = 'umap', group.by = 'integrated_snn_res.0.2', label = T, repel = T) + NoLegend()
p2 <- DimPlot(dat.combined, reduction = 'umap', group.by = 'cell_type_major', label = T, repel = T) + NoLegend()
p1 + p2To confirm these cell type labels the authors then checked that each cluster was expressing the correct marker genes for its given cell type label.
The marker genes they used are shown in the following table:
| Cell type | Marker genes |
|---|---|
| NK-cells | NCAM1, NCR1, NKG2 (KLRK1) |
| cytotoxic T, NK | GNLY, PFN1, GZMA, GZMB, GZMM, GZMH |
| Exhausted T cell, T-regulatory Cell | FOXP3, CTLA4, TIGIT, TNFRSF4, LAG3, PDCD1 |
| T cells | CD8 (CD8A), CD3 (CD3E), CD4 |
| Naive T cells | IL7R |
| B cells | CD19 |
| Mast cells | ENPP3, KIT |
| plasmacytoid DC | IL3RA, LILRA4 |
| Monocytic Lineage | HLA-DR (HLA-DRA), FCGR3A, CD68, ANPEP, ITGAX, CD14, ITGAM, CD33 |
markers <- list(`NK cells` = c('NCAM1', 'NCR1', 'KLRK1'),
`Cytotoxic T/\nNK cells` = c('GNLY', 'PFN1', 'GZMA', 'GZMB', 'GZMM', 'GZMH'),
`Exhausted T/\nTregs` = c('FOXP3', 'CTLA4', 'TIGIT', 'TNFRSF4', 'LAG3', 'PDCD1'),
`T cells` = c('CD8A', 'CD3E', 'CD4'),
`Naive\nT cells` = c('IL7R'),
`B\ncells` = c('CD19'),
`Mast\ncells` = c('ENPP3', 'KIT'),
`pDC` = c('IL3RA', 'LILRA4'),
`Monocytic\nlineage` = c('HLA.DRA', 'FCGR3A', 'CD68', 'ANPEP', 'ITGAX', 'CD14', 'ITGAM', 'CD33'))
DotPlot(dat.combined, features = markers, group.by = 'cell_type_minor', assay = 'RNA') +
scale_colour_viridis_c(option = 'H') +
theme(axis.text = element_text(size = 8),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.title = element_blank(),
legend.title = element_text(size = 9),
legend.text = element_text(size = 8),
legend.position = 'bottom',
strip.text = element_text(size = 7)) +
coord_cartesian(clip = 'off')## Warning: The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
## ℹ Please use the `rows` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
The dotplot above shows that all cell types are expressing the genes you would expect them to which would indicate that the cells are well-labelled and we can move on to downstream analysis using these labels.
To investigate how the immune micro-environment is altered in breast cancer we can compare the cell type composition in normal and tumour tissue. We use proportions rather than absolute number of cells as the number of cells captured for each sample will vary hugely due to technical factors.
# calculate percentages for each cell type in each sample
plot_df <- dat.combined@meta.data %>%
dplyr::select(sample, tissue, cell_type_minor) %>%
group_by(sample, tissue, cell_type_minor) %>%
tally() %>% ungroup() %>% group_by(sample) %>%
mutate(per = n/sum(n))
# plot results
ggplot(plot_df, aes(x = tissue, y = per, fill = tissue)) +
geom_boxplot(outlier.alpha = 0, col = 'black') +
geom_jitter() +
geom_signif(comparisons = list(c("TUMOR", "NORMAL")),
margin_top = 0.01) + # performs wilcoxon test to generate p values
geom_point(aes(y = per * 1.1), alpha = 0) +
scale_y_continuous(labels = function(x) scales::percent(x, accuracy = 0.1)) +
scale_fill_manual(values = c(NORMAL = '#FFB901', TUMOR = '#6966CD'), name = 'Tissue') +
labs(x = NULL, y = 'Percentage of all cells', title = 'Changes in cell type proportions') +
facet_wrap(~cell_type_minor, scales = 'free', nrow = 5) +
coord_cartesian(clip = 'off') +
theme_minimal(base_size = 12) +
theme(axis.text = element_text(colour = 'black'),
strip.text = element_text(margin = margin(b = 10), colour = 'black'),
axis.ticks = element_line(colour = 'gray20'),
legend.position = 'bottom',
plot.title = element_text(hjust = 0.5))We can also look at differential expression of genes to see which
cell types are altered in the tumour micro-environment. The
FindMarkers() function in Seurat can be used to identify
DEGs between conditions for each cell type. This function implements a
wilcoxon test by default. We will have a look of DEGs in cancer vs
normal for each cell type.
deg_df <- data.frame()
for(i in unique(dat.combined$cell_type_major)){
sub <- dat.combined[, dat.combined$cell_type_major == i]
tryCatch({
degs <- FindMarkers(sub, assay = 'RNA', group.by = 'tissue', ident.1 = 'TUMOR', ident.2 = 'NORMAL', features = features)
degs <- degs %>% rownames_to_column(var = 'gene') %>% mutate(cell = i)
deg_df <- rbind(deg_df, degs)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}## ERROR : Cell group 1 has fewer than 3 cells
library(EnhancedVolcano)
# Get unique cell types
cell_types <- unique(deg_df$cell)
volcano_list <- list()
# Loop through each cell type and generate a volcano plot
for (ct in cell_types) {
deg_df_ct <- deg_df %>% filter(cell == ct)
p <- EnhancedVolcano(deg_df_ct,
lab = deg_df_ct$gene,
x = "avg_log2FC",
y = "p_val_adj",
ylab = "-Log10 adjusted p-val",
pCutoff = 0.05,
FCcutoff = log2(1.5),
col = c("black", "black", "black", "red3"),
colAlpha = 1,
subtitle = "",
title = ct,
legendPosition = 'right',
legendLabSize = -1,
legendIconSize = -1,
titleLabSize = 15)
volcano_list[[ct]] <- p
}
for(i in 1:length(volcano_list)){
print(volcano_list[i])
}## $T
##
## $NK
##
## $B
##
## $mDC
##
## $MONOCYTE
##
## $NEUTROPHIL
##
## $NKT
##
## $MACROPHAGE
##
## $MAST
We observed that certain cell types are transcriptionally distinct in the cancer vs normal conditions, such as T, B, and monocyte cells. If we look at some of the underlying genes in T-cell for instance, we see an upregulation of early intermediate T-cell activation signatures (JUN, JUNB, DUSP1, DUSP2, ATF3, KLRC1, etc) as well as heat shock proteins (HSPA6, HSPA1A, and HSPA1B). This is interesting as these genes are commonly activated upon antigen presentation, in this case likely due to tumour neo-antigen presentation. It is also interesting to see the downregulation of innate immune system signature genes (C1QA, C1QC, and EGR1)to be downregulated in monocyte.
From DE analysis, it looks like the T and monocyte cells show the most significant change in gene expression between tumour and normal samples. We can use the ClusterProfiler R package to try and figure out what biological processes these genes might be involved in.
all_genes <- names(rowSums(dat.combined@assays$RNA@counts)[rowSums(dat.combined@assays$RNA@counts) > 0])
DE_genes <- deg_df %>%
filter(cell %in% c("T", "MONOCYTE"),
abs(avg_log2FC) > log2(1.5), p_val_adj <= 0.05)
DE_genes <- split(DE_genes, DE_genes$cell)
go_list <- list()
for(i in 1:length(DE_genes)){
go_list[[i]] <- enrichGO(gene = DE_genes[[i]]$gene, universe = all_genes,
OrgDb = org.Hs.eg.db, keyType = 'SYMBOL', ont = "BP")
names(go_list)[i] <- names(DE_genes)[i]
go_list[[i]] <- go_list[[i]]@result %>% dplyr::filter(p.adjust < 0.05)
}
for (i in 1:length(go_list)){
temp <- go_list[[i]]
temp <- temp %>% dplyr::arrange(p.adjust) %>% head(., 15)
p <- ggplot(data = temp, aes(x = zScore, y = reorder(Description, zScore), fill = -log10(p.adjust))) +
geom_col() +
scale_fill_distiller(palette = 'PuRd', limits = c(0, 12), direction = 1) +
theme_minimal() +
theme(
plot.title.position = 'plot',
plot.title = element_text(hjust = 0.5)
) +
labs(x = "z-score", y = "GO Term", fill = "-log10(p.adjust)", title = paste0("GO Enrichment of ",names(go_list)[i]," cells DEG in tumor vs normal") )
print(p)
}In the follow up session with Mr Sodiq Hameed, you will dive deeper into downstream analyses of scRNA-seq and spatial transcriptomics analysis.